---
title: 'Replication for "When Correlation Is Not Enough: Validating Populism Scores from Supervised Machine-Learning Models" (by Jankowski/Huber; accepted for publication at Political Analysis)'
author: Michael Jankowski, Robert A. Huber
output:
  rmdformats::robobook:
    use_bookdown: true
    self_contained: true
    thumbnails: false
    lightbox: true
    gallery: false
    highlight: monochrome
editor_options:
  chunk_output_type: console
---

```{r globopt, include = F}
knitr::opts_chunk$set(warning = FALSE)
```

# Hardware

All code was run on a computer with an Intel i7-9750H CPU (8 cores), 16GB RAM under Windows 10.

# Software

Please install `RStudio` and `R`. `R` should be version > 4.1x.

Please also install Anaconda (we used version 2.11; https://www.anaconda.com/products/distribution) to run the python scripts. We used version `Python 3.9.7` and use Jupiter Notebooks to execute the code. Also note that having python installed **is necessary to run the `R` code** as we rely on the `reticulate` package (which runs code from python in `R`). 

# Runtime

Runtime for the `R` code is about 20 minutes including download of DCM's replication data.

Runtime for the complete Python code is about 7 days using the hardware described above.

# R and Python Packages

Required R packages are described in more detail below.

Required python packages are as follows:

- matplotlib (3.4.1)
- seaborn (0.11.1)
- scipy (1.6.2)
- numpy (1.20.1)
- json (2.0.9)
- sklearn (0.23.1)
- pandas (1.2.3)
- shap (0.39.0)

# Hard drive space

The complete replication package requires around 15 GB of free hard drive space.

# Description

Please note that we focus on the part of the analysis in which we use `R`. *The supervised machine-learning models were run using Python*. The Python-scripts were largely adopted from Di Cocco and Monechi's replication files. In most cases, we only added some lines of code. Please see the folder `python_files` for these scripts. Their README.txt file also provides information on the specific scripts. We ran the Python scripts using Anaconda (see above).

As the tables and figures reported in this notebook are based on analyses run in Python, we strongly advice running the Python code to verify the reproducibility of our findings. In the case of the reshuffling analyses, this takes quite a long time (each country-reshuffling analysis takes several hours; sometimes more than 24h on the computer described above). Total runtime for the python code on the computer mentioned above is about 7 days. If you do not want to run the Python code, you can still reproduce the tables and figures based on the `R` code as we provide all the results of the Python models in the downloaded replication material. However, you still need to have Python installed as the `R` code sometime relies on Python code using the `reticulate` package.

# How to proceed

All steps required to replicate the results in `R` can be run from the file `master_replication.R`.

To replicate all tables and figures of our manuscript you first need to make sure that you have all `R` packages installed and loaded. We provide `R` that installs all required packages. After that, you need to "untar" all replication data created by us. We also provide `R` for this step. Finally, you also need to download the replication data from the paper by Di Cocco and Monechi (doi: <https://doi.org/10.1017/pan.2021.29>; replication material: <https://doi.org/10.7910/DVN/BMJYAN>; DCM). This can also be done using our `R` code.

After completing these steps, you can replicate all our results using Python and `R`. As said above, if you do not want to run the python models, you can also just replicate the tables and figures based on the `R` code.

# Prepare Replication Data

After downloading our replication material from Political Analysis' Harvard Dataverse, please open the file `JH_Replication_Political_Analysis.Rproj`. You need to have installed `R` (version 4.1x) and `RStudio` for this. After opening the `.Rproj` file in `RStudio`, please open the file `master_replication.R` within the project. In the next steps, we describe each of the steps of the `master_replication.R` file.

# Install/Load Packages

The following code installs all packages (if required). After that it loads the packages. The code then also uses the `here` package to set the working directory. Runtime of this code depends on the number of packages that you have already installed. However, even when you have no packages installed, it should only take a few minutes.

```{r pkgload}
run_start <- Sys.time()

pkgs <- c("tidyverse",
          "data.table",
          "visreg",
          "xtable",
          "broom",
          "rio",
          "texreg",
          "reticulate",
          "xml2",
          "here",
          "dataverse",
          "rvest")

# Funtion to check if packages are installed
# If not: package will be installed from CRAN and then loaded
# If: Package will be loaed

install_load <- function(packages){
  
  for (p in packages) {
    cat("Check package: '", p, "'...\n", sep = "")
    flush.console()
    
    if (p %in% rownames(installed.packages())) {
      
      cat("Package: '", p, "' is already installed...\n\n", sep = "")
      flush.console()
      
      library(p, character.only=TRUE)
      
    } else {
      
      cat("Package: '", p, "' is NOT installed! Will install now...\n\n", sep = "")
      install.packages(p)
      library(p,character.only = TRUE)
      
    }
  }
  cat("\nAll packages installed!\n\n")
}

# Apply function to all required packages

install_load(pkgs)

# Set wd with here() package

here::i_am("master_replication.R")

# Install relevant python packages:

py_install("pandas")

run_stop <- Sys.time()
run_time <- (run_stop - run_start)
run_time
```

# Untar data

Our replication data is compressed in `.tar`-files which need to be untared. This can be done by running the following file:

```{r untardata}
run_start <- Sys.time()

library(here)

all_tar_files <- list.files(here(),
                            pattern = "\\.tar$")

lapply(all_tar_files, function(x){
  
  untar(here(x))
  
})

run_stop <- Sys.time()
run_time <- (run_stop - run_start)
run_time

rm(list = ls())
```

After running this code, you should see the following folders in your working directory:

-   VDEM

-   reshuffled_data

-   R_Scripts

-   python_output

-   python_files

-   py_functions

-   POPPA

-   output

# Download of Di Cocco and Monechi's Replication Data

Finally, the replication data from Di Cocco and Monechi's paper needs to be downloaded. This can also be done using `R`.

```{r download}
run_start <- Sys.time()

dir.create(here("dcm_dataverse"))

files <- list("00_generate_bag_of_words.ipynb" = "https://dataverse.harvard.edu/api/access/datafile/4756350",
           "01_train_all_models.py" = "https://dataverse.harvard.edu/api/access/datafile/4756365",
           "01_train_model.ipynb" = "https://dataverse.harvard.edu/api/access/datafile/4756351",
           "02_compute_scores.ipynb" = "https://dataverse.harvard.edu/api/access/datafile/4756352",
           "03_results.ipynb" = "https://dataverse.harvard.edu/api/access/datafile/4756353",
           "CHES_data.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756355",
           "GPD_data.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756357",
           "main_text_figures.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756358",
           "models.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756366",
           "POPPA_data.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756359",
           "README.txt" = "https://dataverse.harvard.edu/api/access/datafile/4756360",
           "SI_figures.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756362",
           "training_results.json" = "https://dataverse.harvard.edu/api/access/datafile/4756363",
           "train_configurations.py" = "https://dataverse.harvard.edu/api/access/datafile/4756364",
           "bow_and_labels.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756354",
           "datasets.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756356",
           "scores.tar.gz" = "https://dataverse.harvard.edu/api/access/datafile/4756361")

options(timeout=2000) # (increase time for download to 2000sec., default of R is 60sec.)

for(i in 1:length(files)){

  fname <- names(files)[i]
  flink <- files[[i]]

  cat("Download of file: ", fname," (",i," of ", length(files), " files)\n", sep = ""); flush.console()
  
  download.file(flink,
                destfile = here("dcm_dataverse", fname),
                mode = "wb")

}

cat("\n\nAll files downloaded!\n\nNow unzipping tar files.\n\n")

all_tar_files <- list.files(path = here("dcm_dataverse"),
                            pattern = "\\.tar\\.gz$",
                            full.names = TRUE)


lapply(all_tar_files, function(x){
  
  cat("Untar:", x, "\n\n"); flush.console()
  
  untar(x, 
        exdir = here("dcm_dataverse")) 
  
})

run_stop <- Sys.time()
run_time <- (run_stop - run_start)
run_time

cat("All files downloaded and extracted.\n\n")
```

The code specifies which files should be downloaded from the dataverse, then downloads the files and finally unpacks the `tar.gz`-files. Please note: downloading the data takes some time (5-10 minutes) and requires about 14GB space on your hard-drive. Download progress is reported.

# Replication of results

Now, all tables and figures can be replicated using `R`.

## Table 1: Most Important Features (mean impurity approach)

The file `Table1.R` contains the code for creating Table 1 of the manuscript. It also computes some proportions which we mentioned in the main text (proportion of sentences containing party names in Germany). Please note that the Tables are printed as `LaTeX` code which we manually edited to adjust the look of the table.

```{r tab1, code = readLines(here("R_scripts","Table1.R"))}
```

## Figure 1 (Comparison: DCM/V-DEM Scores for Austria)

In Figure 1 we compare DCM's populism scores for Austria with V-DEM's populism scores for Austria. The code to create the figure can be found in `Figure1.R`.

```{r fig1, fig.width=8, fig.height=6, code = readLines(here("R_scripts","Figure1.R"))}
```

## Figure 2 (Reshuffling Results)

Figure 2 in the main text consists of Figure 2a and 2b. The code to create these figures is in `Figure2.R`. The code also creates some figures for the appendix (similar figures like 2a and 2b for the other countries; see Figures A19 - A28 in the appendix).

```{r fig2, fig.width=10, fig.height=5, code = readLines(here("R_scripts","Figure2.R"))}
```

## Figure 3 (Correlation of Reshuffling Scores)

Figure 3 of the main text displays the correlations of the reshuffled scores and DCM's scores with the POPPA and V-DEM populism scores. The code for this Figure can be found in `Figure3.R`.

```{r fig3, fig.width=10, fig.height=5, code = readLines(here("R_scripts","Figure3.R"))}
```

Please note that this file calls the scripts `reshuffle_corr_vdem.R` and `reshuffle_corr_poppa.R` which merge the reshuffled data and VDEM/POPPA data. The code of these scripts is displayed below.

```{r fig3a, eval = F, code = readLines(here("R_scripts","reshuffle_corr_vdem.R"))}
```

```{r fig3b, eval = F, code = readLines(here("R_scripts","reshuffle_corr_poppa.R"))}
```

## Appendix A1 (POPPA vs. PopuList)

Appendix A1 shows the relationship between POPPA's populism score and the classification of populist parties from the PopuList. The code is stored in `AppendixA1.R`.

```{r figA1, fig.width=10, fig.height=5, code = readLines(here("R_scripts","AppendixA1.R"))}
```

## Appendix A2 (V-DEM and DCM Scores)

In Appendix A2 we run regression analyses to test the effect of V-DEM's temporal populism scores on DCM's populism scores. The code for this analysis is stored in `AppendixA2.R`. Note that we run FE regressions by including party dummies in the OLS model. This is equivalent to specifying explicitly an FE Regression model.

```{r fereg, code = readLines(here("R_scripts","AppendixA2.R"))}
```

## Appendix A3 (SHAP Values)

Appendix Section A3 displays the top-15 SHAP features for all countries. The SHAP features are computed in python. Here we combine the SHAP values at the sentence level and combine it with the respective sentences to recreate a "SHAP summary" plot.

```{r shap, fig.width=10, fig.height=7, code = readLines(here("R_scripts","AppendixA3.R"))}
```

## Appendix A4 (Populism and Thick Ideologies)

Appendix A4 demonstrates the relationship between thick ideologies and populism in the six countries. It is created using the following code (stored in `AppendixA4.R`).

```{r figA4, code = readLines(here("R_scripts","AppendixA4.R"))}
```

## Appendix A5 (Coding Error Austria)

The content of Appendix A5 was either created manually (Table A3) or included in the previous code chunks (Table A4 in `Table1.R` and Figure A7 in `AppendixA3.R`).

## Appendix A6 (Coding Error Germany)

Figure A8 in Appendix A6 is created using the following code (`AppendixA6.R`).

```{r figA6, fig.width=10, fig.height=4, code = readLines(here("R_scripts","AppendixA6.R"))}
```

## Appendix A7 (Removing Party Names - Germany)

Appendix A7 discusses the model performance for Germany when party names are removed. The code can be found in `AppendixA7.R`. Please note that Table A5 was created manually based on python output.

```{r figA7, fig.width=10, fig.height=6, code = readLines(here("R_scripts","AppendixA7.R"))}
```

## Appendix A8 (Removing Party Names - Netherlands)

Appendix A8 is identical to Appendix A7 but it applies the analyses to the case of the Netherlands.

```{r figA8, fig.width=10, fig.height=6, code = readLines(here("R_scripts","AppendixA8.R"))}
```

## Appendix A9 (Reshuffling Results other Countries)

The content of appendix A9 was already created when running `Figure2.R`.

## Appendix A10 (Top-50 Most Important Features)

In Appendix A10 we report the top-50 features based on the mean impurity approach. The code for this analysis is stored in `Appendix10.R` and pasted below.

```{r fig10, fig.width=10, fig.height=12, code = readLines(here("R_scripts","AppendixA10.R"))}
```

# Session Info

This notebook was run using the following setup:

```{r}
pander::pander(sessionInfo())
```
